Experience In WordNet Sense Tagging In The Wall Street Journal

نویسندگان

  • Janyce Wiebe
  • Julie Maples
  • Lei Duan
  • Rebecca F. Bruce
چکیده

This paper reports on our experience hand tagging the senses of 25 of the most frequent verbs in 12,925 sentences of the Wall Street Journal Treebank corpus (Marcus et al. 1993). The verbs are tagged with respect to senses in WordNet (Miller 1990). Some of the annotated verbs can function as both main and auxiliary verbs, and some are often used in idioms. This paper suggests consistently representing these as separate subclasses. Strategies described in the coding instruction for recognizing idioms are described, as well as some challenging ambiguities found in the data. 1 I n t r o d u c t i o n This paper reports on our experience hand tagging the senses of 25 of the most frequent verbs in 12,925 sentences of the Wall Street Journal Treebank corpus (Marcus et al. 1993). The purpose of this work is to support related work in automatic word-sense disambiguation. The verbs are tagged with respect to senses in WordNet (Miller 1990), which has become widely used, for example in corpus-annotation projects (Miller et al. 1994, Ng & Hian 1996, and Grishman et al. 1994) and for performing disambiguation (Resnik 1995 and Leacock et ai. 1993). The verbs to tag were chosen on the basis of how frequently they occur in the text, how wide their range of senses, and how distinguishable the senses are from one another. In related work, we have begun to tag nouns and adjectives as well. These are being chosen additionally on the basis of co-occurrence with the verbs already tagged, to support approaches such as (Hirst 1987), in which word-sense ambiguities are resolved with respect to one another. Some of the chosen verbs can function as both main and auxiliary verbs, and some are often used in idioms. In this paper, we suggest consistently representing these as separate subclasses. We apply a preprocessor to the data, which automatically identifies some classes of verb occurrence with good accuracy. This facilitates manual annotation, because it is easier to fix a moderate number of errors than to tag the verbs completely from scratch. The preprocessor performs other miscellaneous tasks to aide in the tagging task, such as separating out punctuation marks and contractions. At the end of the paper, we share some strategies from our coding instructions for recognizing idioms, and show some challenging ambiguities we found in the data. 2 The Verbs and the Basic Tag Format The following are the verbs that were tagged. The total number of occurrences is 6,197. VERB NUMBER VERB NUMBER have 2740 make 473 take 316 get 231 add 118 pay 189 see 159 call 151 decline 84 hold 127 come 191 give 168 keep I01 know 87 find 130 lose 82 believe 103 raise 124 drop 61 lead 105 work 101 leave 81 run 105 look 95 meet 75 The basic tags have the following form. Extensions will be given below. word_

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Construction of Japanese WordNet

Although WordNets have been developed for a number of languages, no attempts to construct a Japanese WordNet have been known to exist. Taking this into account, we launched a project to automatically translate the Princeton WordNet into Japanese by a method of unsupervised word-sense disambiguation using bilingual comparable corpora. The method we propose aligns English word associations with t...

متن کامل

Combining Independent Knowledge Sources for Word Sense Disambiguation

Disambiguation Yorick Wilks and Mark Stevenson Department of Computer Science, University of She eld, Regent Court, 211 Portobello Street, She eld S1 4DP, UK fyorick, [email protected] Abstract Sense tagging, the automatic assignment of the appropriate sense from some lexicon to each of the words in a text, is a specialised instance of the general problem of word sense disambiguation. We di...

متن کامل

Automatic Construction of Persian ICT WordNet using Princeton WordNet

WordNet is a large lexical database of English language, in which, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets). Each synset expresses a distinct concept. Synsets are interlinked by both semantic and lexical relations. WordNet is essentially used for word sense disambiguation, information retrieval, and text translation. In this paper, we propose s...

متن کامل

Automatic construction of a hypernym-labeled noun hierarchy from text

The purpose of this work is to build something like the hypernym-labeled noun hierarchy of WordNet (Fellbaum, 1998) automatically from t e x t using no other lexical resources. WordNet has been an important research tool, but it is insufficient for domainspecific text, such as that encountered in the MUCs (Message Understanding Conferences). Our work develops a labeled hierarchy based on a text...

متن کامل

Automatic Interpretation of Noun Compounds Using WordNet Similarity

The paper introduces a method for interpreting novel noun compounds with semantic relations. The method is built around word similarity with pretagged noun compounds, based on WordNet::Similarity. Over 1,088 training instances and 1,081 test instances from the Wall Street Journal in the Penn Treebank, the proposed method was able to correctly classify 53.3% of the test noun compounds. We also i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997